Learning Mixtures of Gaussians using the k-means Algorithm
نویسندگان
چکیده
One of the most popular algorithms for clustering in Euclidean space is the k-means algorithm; k-means is difficult to analyze mathematically, and few theoretical guarantees are known about it, particularly when the data is well-clustered. In this paper, we attempt to fill this gap in the literature by analyzing the behavior of k-means on well-clustered data. In particular, we study the case when each cluster is distributed as a different Gaussian – or, in other words, when the input comes from a mixture of Gaussians. We analyze three aspects of the k-means algorithm under this assumption. First, we show that when the input comes from a mixture of two spherical Gaussians, a variant of the 2-means algorithm successfully isolates the subspace containing the means of the mixture components. Second, we show an exact expression for the convergence of our variant of the 2-means algorithm, when the input is a very large number of samples from a mixture of spherical Gaussians. Our analysis does not require any lower bound on the separation between the mixture components. Finally, we study the sample requirement of k-means; for a mixture of 2 spherical Gaussians, we show an upper bound on the number of samples required by a variant of 2-means to get close to the true solution. The sample requirement grows with increasing dimensionality of the data, and decreasing separation between the means of the Gaussians. To match our upper bound, we show an information-theoretic lower bound on any algorithm that learns mixtures of two spherical Gaussians; our lower bound indicates that in the case when the overlap between the probability masses of the two distributions is small, the sample requirement of k-means is near-optimal.
منابع مشابه
PAC Learning Mixtures of Gaussians with No Separation Assumption
We propose and analyze a new vantage point for the learning of mixtures of Gaussians: namely, the PAC-style model of learning probability distributions introduced by Kearns et al. [12]. Here the task is to construct a hypothesis mixture of Gaussians that is statistically indistinguishable from the actual mixture generating the data; specifically, the KL divergence should be at most 2. In this s...
متن کاملOn Spectral Learning of Mixtures of Distributions
We consider the problem of learning mixtures of distributions via spectral methods and derive a tight characterization of when such methods are useful. Specifically, given a mixture-sample, let μi, Ci, wi denote the empirical mean, covariance matrix, and mixing weight of the i-th component. We prove that a very simple algorithm, namely spectral projection followed by single-linkage clustering, ...
متن کاملPAC Learning Mixtures of Axis-Aligned Gaussians with No Separation Assumption
We propose and analyze a new vantage point for the learning of mixtures of Gaussians: namely, the PAC-style model of learning probability distributions introduced by Kearns et al. [13]. Here the task is to construct a hypothesis mixture of Gaussians that is statistically indistinguishable from the actual mixture generating the data; specifically, the KL divergence should be at most ǫ. In this s...
متن کاملPAC Learning Axis-Aligned Mixtures of Gaussians with No Separation Assumption
We propose and analyze a new vantage point for the learn-ing of mixtures of Gaussians: namely, the PAC-style model of learningprobability distributions introduced by Kearns et al. [12]. Here the taskis to construct a hypothesis mixture of Gaussians that is statistically in-distinguishable from the actual mixture generating the data; specifically,the KL divergence should be a...
متن کاملClustering Methods for Credit Card using
K-means clustering algorithm is a method of cluster analysis which aims to partition n observations into clusters in which each observation belongs to the cluster with the nearest mean. It is one of the simplest unconfirmed learning algorithms that solve the well known clustering problem. It is similar to the hope maximization algorithm for mixtures of Gaussians in that they both attempt to fin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/0912.0086 شماره
صفحات -
تاریخ انتشار 2009